Collecting Execution Statistics of Scientific Workflow on Hadoop YARN

نویسندگان

  • Hannes Schuh
  • Ulf Leser
  • Marc Bux
  • Jörgen Brandt
چکیده

The development of computers, sensors and other technical instruments changed scientific research. Current scientific experiments often contain a whole range of computational activities, e.g: capturing and saving huge amounts of data in databases or flat files, analysis of data with software and data visualization [1]. The complexity of these data intensive experiments led to new challenges. Developing and maintaining the physical infrastructure to store the data and execute the analysis pipelines is expensive [2]. Data provenance is also an important aspect scientists need to ensure the origin of the results and the underlying data. They need to share the knowledge and resulting information with the community in order that other scientists can repeat and review the experiments. Scientific workflows provide a means for representing and managing such analysis pipelines. An activity of the analysis pipeline is encapsulated in a workflow step (task). A scientific workflow is a directed, acyclic graph (DAG), in which individual tasks are represented as nodes. Edges represent data dependencies between tasks. One task can be local software or a remote (web) service call which transforms input data to output data. Task execution is constrained by data dependencies [3]. An abstract workflow is a high-level workflow description. Scientists model an abstract workflow by specifying a set of individual tasks and the data dependencies between them [4]. Figure 2 shows an abstract workflow. Scientific workflow management systems (SWfMS) are used to manage and execute scientific workflows. In addition, they often support scientists to record provenance information and statistics about the execution of the workflows. Provenance data is supposed to trace the flow of data through the workflow tasks to reason about the results. Statistics about the execution of workflows and tasks are useful not only for provenance questions: Precise information about historical executions of a workflow can help the SWfMS to execute the workflow faster or to present progress and time-remaining estimation [5].

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hi-WAY: Execution of Scientific Workflows on Hadoop YARN

Scientific workflows provide a means to model, execute, and exchange the increasingly complex analysis pipelines necessary for today’s data-driven science. However, existing scientific workflow management systems (SWfMSs) are often limited to a single workflow language and lack adequate support for large-scale data analysis. On the other hand, current distributed dataflow systems are based on a...

متن کامل

SAASFEE: Scalable Scientific Workflow Execution Engine

Across many fields of science, primary data sets like sensor read-outs, time series, and genomic sequences are analyzed by complex chains of specialized tools and scripts exchanging intermediate results in domain-specific file formats. Scientific workflow management systems (SWfMSs) support the development and execution of these tool chains by providing workflow specification languages, graphic...

متن کامل

MEWSE: multi-engine workflow submission and execution on apache YARN

In this era of BigData, designing a workflow to gain insights from the vast amount of data has become more complex.There are several different frameworks which individually process the batch and streaming data but coordinating the jobs between the engines in the workflow creates a performance penalty and other performance issues. Current workflow systems typically run only on one engine and do ...

متن کامل

Cuneiform: a Functional Language for Large Scale Scientific Data Analysis

The need to analyze massive scientific data sets on the one hand and the availability of distributed compute resources with an increasing number of CPU cores on the other hand have promoted the development of a variety of languages and systems for parallel, distributed data analysis. Among them are data-parallel query languages such as Pig Latin or Spark as well as scientific workflow languages...

متن کامل

A Clustering Approach to Scientific Workflow Scheduling on the Cloud with Deadline and Cost Constraints

One of the main features of High Throughput Computing systems is the availability of high power processing resources. Cloud Computing systems can offer these features through concepts like Pay-Per-Use and Quality of Service (QoS) over the Internet. Many applications in Cloud computing are represented by workflows. Quality of Service is one of the most important challenges in the context of sche...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014